广州医药 ›› 2026, Vol. 57 ›› Issue (1): 70-76.DOI: 10. 20223 / j. cnki. 1000-8535. 2026. 01. 010

• 论著 • 上一篇    下一篇

基于超声与钼靶报告及影像的大模型诊断性能评估

吕佳忆1,2, 佟文娟3, 林欣欣3, 林雅丹2, 王伟3, 郭媛4, 杨红1,2   

  1. 1 广西医科大学再生医学与医用生物资源开发应用省部共建协同创新中心(广西南宁 530021);
    2 广西医科大学第一附属医院超声医学科(广西南宁 530021);
    3 中山大学附属第一医院超声医学科(广东广州 510000);
    4 华南理工大学附属第二医院(广州市第一人民医院)放射科(广东广州 510180)
  • 收稿日期:2025-04-13 发布日期:2026-02-03

Evaluation of large language models' diagnostic performance based on ultrasound and mammography reports and images

LYU Jiayi1,2, TONG Wenjuan3, LIN Xinxin3, LIN Yadan2, WANG Wei3, GUO Yuan4, YANG Hong1,2   

  1. 1 Collaborative Innovation Centre of Regenerative Medicine and Medical BioResource Development and Application Co-constructed by the Province and Ministry,Guangxi Medical University,Nanning 530021,China;
    2 Department of Medical Ultrasound,the First Affiliated Hospital of Guangxi Medical University,Nanning 530021,China;
    3 Department of Medical Ultrasonics,Ultrasomics Artificial Intelligence X-Lab,Institute of Diagnostic and Interventional Ultrasound,The First Affiliated Hospital of Sun Yat-Sen University,Guangzhou 510000,China;
    4 Department of Radiology,Guangzhou First People' s Hospital,the Second Affiliated Hospital,School of Medicine,South China University of Technology,Guangzhou 510180,China
  • Received:2025-04-13 Published:2026-02-03

摘要: 目的 评估ChatGPT 4与Llama 3微调模型在乳腺癌诊断中的应用效果,特别是在超声、钼靶及超声联合钼靶的非结构化报告和影像诊断方面。方法 回顾性收集了689例同时接受乳腺超声和钼靶检查的患者数据,比较两种模型在文本和图像模态下的诊断性能,并探讨乳腺密度对模型表现的影响。结果 在文本模态下,微调Llama 3表现优异,联合诊断准确率达91.7%,优于ChatGPT 4的71.7%。图像模态中两模型准确率均低于70%,但ChatGPT 4灵敏度较高(78.3%),Llama 3特异度突出(98.3%)。分组分析表明,在非致密型乳腺中钼靶表现更佳,而致密型乳腺中超声诊断更具优势。结论 大语言模型在医学图像处理和多模态整合方面仍需进一步优化,医学领域微调的大语言模型在处理非结构化临床文本方面具有潜力。

关键词: 大语言模型, 乳腺癌, 超声, 钼靶

Abstract: Objective To evaluate the application effectiveness of ChatGPT 4 and the fine-tuned Llama 3 model in breast cancer diagnosis,particularly in processing unstructured reports and diagnostic imaging of ultrasound,mammography,and their combined modalities. Methods Retrospective data from 689 patients who underwent both breast ultrasound and mammography examinations were collected.The diagnostic performance of the two models was compared across text and image modalities,and the impact of breast density on model performance was explored. Results In the text modality,the fine-tuned Llama 3 model performed excellently,achieving a combined diagnostic accuracy of 91.7%,outperforming 71.7% of ChatGPT 4.In the image modality,both models had accuracies below 70%,but ChatGPT 4 exhibited higher sensitivity(78.3%),while Llama 3 demonstrated outstanding specificity(98.3%).Subgroup analysis indicated that mammography performed better in non-dense breasts,whereas ultrasound was more advantageous in dense breasts. Conclusions The large language models still require further optimization in medical image processing and multimodal integration,but fine-tuned large language models in the medical field show potential in handling unstructured clinical texts.

Key words: large language model, breast cancer, ultrasound, mammography